Clindex: Clustering for Similarity Queries in High-Dimensional Spaces
نویسندگان
چکیده
In this paper we present a clustering and indexing paradigm (called Clindex) for highdimensional search spaces. The scheme is designed for approximate searches, where one wishes to nd many of the data points near a target point, but where one can tolerate missing a few near points. For such searches, our scheme can nd near points with high recall in very few IOs and performs signi cantly better than other approaches. Our scheme is based on nding clusters, and then building a simple but e cient index for them. We analyze the tradeo s involved in clustering and building such an index structure, and present experimental results based on a 30,000 image database.
منابع مشابه
Clustering for Approximate Similarity Search in High-Dimensional Spaces
In this paper we present a clustering and indexing paradigm (called Clindex) for high-dimensional search spaces. The scheme is designed for approximate similarity searches, where one wishes to find many of the data points near a target point, but where one can tolerate missing a few near points. For such searches, our scheme can find near points with high recall in very few IOs and perform sign...
متن کاملDAHC-tree: An Effective Index for Approximate Search in High-Dimensional Metric Spaces
Similarity search in high-dimensional metric spaces is a key operation in many applications, such as multimedia databases, image retrieval, object recognition, and others. The high dimensionality of the data requires special index structures to facilitate the search. A problem regarding the creation of suitable index structures for highdimensional data is the relationship between the geometry o...
متن کاملCSVD: Clustering and Singular Value Decomposition for Approximate Similarity Search in High-Dimensional Spaces
High-dimensionality indexing of feature spaces is critical for many data-intensive applications such as content-based retrieval of images or video from multimedia databases and similarity retrieval of patterns in data mining. Unfortunately, even with the aid of the commonly-used indexing schemes, the performance of nearest neighbor (NN) queries (required for similarity search) deteriorates rapi...
متن کاملCoFD : An Algorithm for Non-distance Based Clustering in High Dimensional Spaces
The clustering problem, which aims at identifying the distribution of patterns and intrinsic correlations in large data sets by partitioning the data points into similarity clusters, has been widely studied. Traditional clustering algorithms use distance functions to measure similarity and are not suitable for high dimensional spaces. In this paper, we propose CoFD algorithm, which is a non-dis...
متن کاملUsing the Distance Distribution for Approximate Similarity Queries in High-Dimensional Metric Spaces
We investigate the problem of approximate similarity (nearest neighbor) search in high-dimensional metric spaces, and describe how the distance distribution of the query object can be exploited so as to provide probabilistic guarantees on the quality of the result. This leads to a new paradigm for similarity search, called PAC-NN (probably approximately correct nearest neighbor) queries, aiming...
متن کامل